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The NEXT Generation Health study investigates the dating vi¬ 
olence of adolescents using a survey questionnaire. Each student is 
asked to affirm or deny multiple instances of violence in his/her dat¬ 
ing relationship. There is, however, evidence suggesting that students 
not in a relationship responded to the survey, resulting in excessive 
zeros in the responses. This paper proposes likelihood-based and es¬ 
timating equation approaches to analyze the zero-inflated clustered 
binary response data. We adopt a mixed model method to account 
for the cluster effect, and the model parameters are estimated us¬ 
ing a maximum-likelihood (ML) approach that requires a Gaussian- 
Hermite quadrature (GHQ) approximation for implementation. Since 
an incorrect assumption on the random effects distribution may bias 
the results, we construct generalized estimating equations (GEE) that 
do not require the correct specification of within-cluster correlation. 

In a series of simulation studies, we examine the performance of ML 
and GEE methods in terms of their bias, efficiency and robustness. 

We illustrate the importance of properly accounting for this zero in¬ 
flation by reanalyzing the NEXT data where this issue has previously 
been ignored. 
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1. Introduction. In public health studies, clustered or longitudinal bi¬ 
nary responses may be collected on a group of individuals where only a 
subset of these individuals are susceptible to having a positive response. For 
example, questionnaires may ask teenagers who are dating to answer a se¬ 
ries of questions about dating violence. As in the NEXT Generation Health 
Study, a larger proportion of all zero responses are observed than would oc¬ 
cur by chance; presumably many individuals who are not dating filled in all 
zeros on the questionnaire (also known as “structural zeros”). While there 
may be alternative reasons for structural zeros, for example, participants 
giving socially desirable responses, we believe this accounts for only a small 
fraction of zero inflation. Interest is in making inference about the correlated 
binary responses for those who are susceptible (i.e., inference about dating 
violence among individuals who were dating). 

There is an extensive literature on zero-inflated Poisson and binomial 
models [Lambert (1992); Hall (2000)] that provide early references, along 
with more recent work on zero-inflated ordinal data [Kelley and Anderson 
(2008)1 and zero-inflated sum score data with randomized responses [Cruyff 
et al. (2008)]. Min and Agresti (2002) reviewed various statistical models 
incorporating zero inflation in both discrete and continuous outcomes for 
cross-sectional data. Diop, Diop and Dupuy (2011) discussed cross-sectional 
binary regression with zero inflation, and proved the model identifiability 
when at least one covariate is continuous. Hall (2000) first considered longi¬ 
tudinal or clustered data with zero-inflated binomial or Poisson outcomes. 
They incorporated a random effect structure to model the within-subject 
correlation and proposed an EM algorithm to estimate the parameters. Hall 
and Zhang (2004) extended the work of Hall (2000) by proposing a gener¬ 
alized estimation equation (GEE) approach to model several zero-inflated 
distributions in a longitudinal setting. Min and Agresti (2005) presented 
a Hurdle model with random effects for repeated measures of zero-inflated 
count data. There has been no work, however, on zero-inflated clustered 
binary data. 

A component of the NEXT Generation Health Study examines the preva¬ 
lence and correlates of dating violence among 2787 tenth-grade students, 
following them over seven years. Dating violence is common among adoles¬ 
cents, may impact adolescent expectations regarding adult intimate relation¬ 
ships [Collins (2003)], and has been found to be associated with increased 
risk of depression and engagement in high-risk behaviors [Ackard, Eisenberg 
and Neumark-Sztainer (2007) and Exner-Cortens, Eckenrode and Rothman 
(2013)]. Thus, dating violence among adolescents merits interest from both 
developmental and public health perspectives [Offenhauer and Buchalter 
( 2011 )]. 

Investigators involved in the NEXT study are primarily interested in iden¬ 
tifying the risk factors associated with dating violence. Haynie et al. (2013) 
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Fig. 1. Distribution of subjects’ responses to five dating violence victimization questions 
and the fitted probabilities using a zero-inflated binomial model (black squares). 


found a relationship between high-risk behaviors (i.e., depressive symptoms, 
alcohol use, smoking and drug use), gender and the prevalence of dating vio¬ 
lence victimization. A total of 10 questions were asked about dating violence. 
Five of the questions were on dating violence victimization: did your partner 
(1) insult you in front of others, (2) swear at you, (3) threaten you, (4) push 
or shove you, or (5) throw anything that could hurt you; the other five were 
similar questions on perpetration: did you (1) insult your partner in front 
of others, (2) swear at your partner, (3) threaten your partner, (4) push or 
shove your partner, or (5) throw anything that could hurt your partner? As 
illustrated in Figure 1, the distribution of the number of “yes” responses is 
clumped at zero. When we fit the frequencies with a zero-inflated binomial 
distribution, the zero-inflation probability is estimated to be about 58%. 
The binomial distribution yields a poor fit to the frequencies for two reasons. 
First, the prevalence of “yes” responses is unequal across different questions; 
second, the responses from the same subject are correlated. But this only 
serves as an intuitive visualization of zero inflation. One can argue that the 
clump of zeros might be due to the high correlation of the binary responses 
within the same subject; and, therefore, we also fit the generalized linear 
mixed model (GLMM) and plot the fitted frequencies in Figure 1. GLMM 
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attempts to fit the spike at 0, and hence tends to overestimate the within- 
subject correlation. In this paper, we hope to explore whether zero inflation 
exists while allowing for the cluster effects. We propose maximum-likelihood 
(ML) and GEE approaches to simultaneously account for the zero inflation 
and clustering in the multiple binary responses. The major difference be¬ 
tween our work and the previous work is that Hall (2000), Hall and Zhang 
(2004), and Min and Agresti (2005) all considered the zero inflation at the 
“observation level,” while in our paper the zero inflation is at the “subject 
level” (meaning that with a structural zero, all the binary responses from 
a subject are zero). Eor our dating violence example, subjects have all zero 
responses because they are not susceptible to the condition (e.g., in a rela¬ 
tionship) . The proposed methods are evaluated and compared in simulation 
studies. We then reexamine the relationships between high-risk behaviors 
and dating violence among teenagers using the proposed analysis strategy 
accounting for zero inflation. 

In Section 2 we present both maximum-likelihood and GEE approaches for 
parameter estimation. Section 3 discusses the identifiability of the proposed 
model and proposes a likelihood ratio test for zero inflation. Simulation 
study results are presented in Section 4. The NEXT dating violence data is 
analyzed in Section 5, and a discussion follows in Section 6. 

2. Method. Let Yj = {Yu ,..., l^j)' be the multivariate binary outcome 
for subject i (i = 1,..., N), and Xj = (Xji,..., Xjj)' be the corresponding 
matrix of covariates. Let Zj be the latent class, so that Yj always takes the 
value of 0 (structural zero) if Zj = 0, and Yj follows a multivariate binary dis¬ 
tribution with density /(Yj;0) if Zj = 1, where 0 is a vector of parameters. 
We suppress the subscript i when there is no confusion. Let p = Pr(Z = 1) 
be the prevalence of the latent class 1. In our example, Zj = 1 indicates that 
subject i is susceptible to the possibility of dating violence (i.e., potential of 
answering the dating violence questions in a positive fashion), while Zj = 0 
indicates that the subject is not susceptible. 

2.1. Maximum-likelihood estimation. If both Y and Z are observed, the 
individual contribution to the full data likelihood is 

L^(Y, Z; 6) = {/(Y = 0)(1 - p)}'-^{/(Y; e)pf. 

The observed likelihood of Y is then given by 

L(Y; 6) = L^{Y, Z = 0; 0) + L^(Y, Z = 1; 0) 

= I{Y = 0){l-p) + f{Y-,e)p. 

Here we assume that the zero-inflation probability p is the same across 
all the subjects in the sample. This could easily be extended to allow p to 
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depend on covariates, for example, with a logistic regression model. We use a 
generalized linear mixed effects model (GLMM) to describe the multivariate 
distribution, /(Y;0): 

where 7 rjj(bj) = Pr(Yij = l|Xjj, bj), bj is the vector of random effects follow¬ 
ing the multivariate normal distribution MVN(0, A), Zij is the design matrix 
of the random effects, and g is the known link function. The parameter vec¬ 
tor 6 consists of the parameter of interest 7 and the nuisance parameters 
in the variance component A. Assume d^j’s are mutually independent given 
Xjj and bj, and let p(bj; A) be the probability density function of bj. Then 
the likelihood for subject i becomes 


L{Y,-e) = I{Yi = 0){l-p) 

J 

+ 


P y< - 7rij(bi))^ U(bi; A) db*. 


0=1 

The integral with respect to the random effects can be approximated by 
Gaussian-Hermite quadrature as 




i-T, 


\p(hi]A)dhi 




Q 


Wq X 7rjj(bj^q) (1 7rjj(bj^q)) 


1-T, 


q=l 


i=i 


where bj^g is the g'th quadrature grid point and Wq is the associated weight 
[Abramowitz and Stegun (1972)]. 

The parameter estimation for p and 9 can be found by maximizing the 
log-likelihood for all N subjects, logL(Yj; 0). The variance estimation 
is calculated from the inverse of the observed information: 


N 

E 


02 


■logL(Y,;0) 


and can be implemented by the optim function in R [R Gore Team (2014)]. 


2.2. Generalized estimating equations (GEE). Likelihood-based inference 
makes full distributional assumptions on Y\Z = 1. When these assumptions 
are correct, the estimator gains efficiency; otherwise, classical inference has 
poor statistical properties. We explore the estimating equations approach 
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[Liang and Zeger (1986)] that only specifies a structure for the conditional 
mean E{Y\Z = l,Xj). Suppose 

(2.1) tif = E{Yi\Z, = l,Xi)=g{X,p), 

where g is the known link function and (3 is the regression coefficients of 
interest. Unconditional on Zj, the “marginal” mean of Yj is given by 

= E{Yi\y.,) = g{XiP) X Pr(Z = 1) + 0 x Pr(Z = 0) 

= pg{XiP). 

The estimating equations can then be written as 

N 

( 2 . 2 ) ^D'iVr\Y,-pLf) = 0, 

i=l 

where Di = and V) is the working covariance matrix for Yj [Liang 

1/9 1 /9 

and Zeger (1986)]. We can decompose V) as A- RiA- with Ai being the 
diagonal matrix of the variance of Yij [which is — Rij)] and Ri being 

the working correlation matrix specified by some nuisance parameter r/. 

If the mean model (2.1) is correct, the estimating equations (2.2) are 
always consistent regardless of the working correlation, and choosing an 
approximately correct working correlation generally leads to improved ef¬ 
ficiency. In the context of zero-inflated regression, we propose two ways 
to specify the working correlation: marginal and conditional specihcation. 
The marginal correlation directly makes assumptions on Ri, which is sim¬ 
ilar to the standard GEE: the marginal independent correlation assumes 
= Ijx j, the J-dimensional identity matrix; the marginal exchangeable 
correlation assumes that Rf^^ = (1 — a)ljxj + aljxj, where Ijx j is the 
J X J square matrix of ones. We refer to these two different approaches as 
GEE-MI and GEE-ME, respectively. 

The conditional correlation exploits the zero-inflated structure and utilizes 
the conditional covariance, = cov(YjjZj = 1), to derive the unconditional 
covariance cov(Yj). A similar idea was first used by Hall and Zhang (2004) 
to derive their GEE estimator for observation-level zero inflation. By the 
law of total covariance, for j ^ j', 

cov(Yj,-, Y,,v) = E{coY{Yij,Yij,\Z)) + cov{E{Y,^\Z),E{Yij,\Z)) 

= ^(Vi^jj'Z) + coY{pfjZ, pfj,Z) 

= V^jj,p + pfjpfj,p{l - p), 

where Uj^j/ is the (j,/) element of IZf, and is the jth element of /if. 
The variance of Yij is given by var(Yjj) = /i^(l — Ri^)- The conditional 
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independence correlation assumes that V-jj, = 0, so the working correlation 


is with the {j,j') element as 


pCI _ 




\lpfjP{'^ - PfjP)pfj'Pi'^ - PffP) 

The conditional exchangeable correlation assumes that 

that is, a correlation of a between any Yij and given Z = 1. Therefore, 
the {j,j') element of the working correlation is 


pCE _ 


- Pfp) + pfjPffPC^ - P) 
^JpfjPi^ - pfjP)pffPi'^ - pffP) 


We refer to these conditional GEE approaches as GEE-CI and GEE-GE, 
respectively. 

Similar to the ordinary GEE, an unstructured working correlation can 
be assumed that allows for distinct correlations for each pair of outcomes. 
With the unstructured GEE, the marginal and conditional specification of 
working correlation are equivalent, that is. 

We refer to this approach as GEE-UN. 

With each of the five forms of working correlation matrices, we could 
solve (2.2) using the Newton-Raphson method to obtain the corresponding 
parameter estimates f3. With the exchangeable or unstructured correlation 
structure, we iteratively update a from its moment estimator and /3 from 
equation (2.2) [Liang and Zeger (1986)]. According to the standard the¬ 
ory of GEE, the variance of the estimated (3 has the usual sandwich form 

21/V 


Ijy^, where 


N 


AN = Y,D[V,Di, 
i=l 
N 

Bn = - /xf )(Y, - iiflVr^D, 


2 = 1 
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2.3. Marginal covariate effect. We note that the regression parameters 
in the GLMM and GEE are not directly comparable as they have differ¬ 
ent interpretations. The former is interpreted as the “subject-specific effect” 
conditional on a subject i, while the latter is the “population-averaged ef¬ 
fect” or “marginal effect” [Zeger, Liang and Albert (1988)]. Thus, GLMM 
and GEE are not compatible for nonidentity link functions. In other words, if 
the GLMM is true, the marginal expectation by integrating out the random 
effects bj may not preserve the linear additive form of the covariates. How¬ 
ever, for binary regression with a probit link and random intercept, GLMM 
and GEE are compatible. We adopt a probit random effects model for both 
the simulations and example analysis. 

Let $ and f be the c.d.f. and p.d.f. of the standard normal distribution. 
Gonsider the generalized linear mixed effects model with a probit link and 
a random intercept only, 


Fr{Yij = = ^( X',-7 + h), 

~iV(0,(Tb). 


By integrating out 6j, the marginal probability of Yij is computed as follows: 





While GLMM estimates Pr(lij = l|Xjj,6j), GEE estimates FviYij = IjXjj). 
The latter is a probit regression model as well, with the regression coeffi¬ 
cients, Y/ \J 1 ■ This allows us to compare the performance of GLMM 

and GEE by comparing the marginal effects of the covariates, which is our 
interest in the dating violence analysis of the NEXT study. 

3. Model identifiability and test for zero inflation. In general, zero- 
inflated models are mixtures of two parametric parts, a point mass at zero 
(equivalently, a binary distribution with p = 0) and a parametric distribution 
for the nonstructural zero part. Typically, zero-inflated models are identi¬ 
fied by observing a larger number of zeros than would be consistent with the 
parametric model. For example, with Poisson or binomial outcomes, one can 
observe excessive proportion of zeros with a histogram. For a single binary 
outcome, zero inflation cannot be distinguished from rare events, unless co¬ 
variate dependence is introduced. When there is a continuous covariate X, 
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zero inflation is identifled because of the linear effect of X on the binary re¬ 
sponse through a known link function. Follmann and Lambert (1991) proved 
a weaker sufficient condition for identiflability when covariates are all cate¬ 
gorical: to identify a two component mixture of logistic regressions with a 
binary response, the covariate vector needs to take at least 7 distinct val¬ 
ues. Kelley and Anderson (2008) also used the same argument to prove the 
identiflability of zero-inflated ordinal regression. Single binary outcome can 
be seen as a special case of our proposed model with J = 1 and cr^ = 0. As 
more information is available with J > 1, our model is also identifled under 
Follmann and Lambert’s condition. 

Diop, Diop and Dupuy (2011) proved the model identiflability for the 
zero-inflated binary regression with at least one continuous covariate. Using 
a similar technique, we can prove our model identiflability. For GEE with a 
probit link, consider {f3',p) and {(3*',p*) to be two parameter vectors that 
yield the same conditional mean E'(L)j|Xjj), that is, 

(3.1) p^X.'(3)=p*^X'p*). 

Equivalently, ^ Suppose the Ith component of X (i.e., xi) is 

continuous, then we can take the partial derivative with respect to xi, which 
yields 

$2(X'/3) 

.3 21 ^ /3r 

^ ’ A ^(X'/3)<^(X'/3*) 

/3r _ pm'P) 

fil p*<p(x'(3*y 

Taking the partial derivative on both sides of (3.2) with respect to xi, and 
with some algebra, it follows that X'/3 = X'/3*, and hence (3 = (3*. Erom 

(3.1) , we further get p = p*■ This proves the identiflability GEE-CI and 
GEE-MI. In GEE-UN, the association parameters are indeed obtained from 
a moment estimator of the correlation between Yij and 1)j'. Since the mean 
model is identifled, the variance and correlation are also identifled. Eor ex¬ 
changeable working correlation, the association parameter is the “average” 
correlation between all Yij and Yiji pairs with j y j ', which is identifiable as 
well. 

We now prove the identiflability of the random effects model (2.1) with 
a probit link. With normally distributed random effects, the mean of Yij 
could be marginalized as <h( ^ variance-covariance 

matrix of random effects b,. We further assume that a continuous covariate 
is contained in X but not in Z. Then the same argument of (3.2) still applies 
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by denoting proves the identifiability of the regression 

coefficients up to a scale. Now it suffices to prove the identifiability of A. 
Denote ajji as the correlation coefficient of Yij and Yiji (j ^ j') given Z = 1. 
Note that from Section 2.2, we have 

cov{Yij,Yij,) = - p). 

Since /3 is identifiable, pfj = d>(Xh/3) is also identifiable. Therefore, if two 
parameter vectors 6 = ( 7 ',^, A)' and 0* = {-y*',p*, A*)' lead to the same 
cov(Yij,Yiji) and EYij, ajji must be the same. Furthermore, the regular 
GLMM is identifiable, suggesting that the correlation structure ajji condi¬ 
tional on Z = 1 is uniquely defined by A. Hence, we prove A = A*, and, 
consequently, the identifiability of the ML estimator is established. 

We also note that when = 0 and no covariates are available, the re¬ 
peated binary counts could be collapsed into a binomial distribution. The 
problem then reduces to the zero-inflated binomial model, which is clearly 
identifiable. In the presence of the random effects, collapsing the binary 
counts leads to an over-dispersed binomial distribution. Hall and Beren- 
haut ( 2002 ) discussed the zero-inflated beta-binomial model, where the over¬ 
dispersion is controlled by a beta distributed random intercept. Our model 
assumes that the over-dispersion comes from a normal distributed random 
intercept. 

Another way to view the proposed model is a mixture of random effect 
distributions. Recall that Y^j follows a Bernoulli distribution with probability 
TTij, given by 

= ^ijl + ^v^i- 

Instead of introducing the latent class Zj, we assume that b, is a mixture of 
normal distribution and a point mass at — 00 : 

^ _ f MVN(0, A), with probability p, 

* I — 00 , with probability 1 —p. 

When bj = — 00 , the probability vTjj is always 0 for j = 1,..., J, so Yj is the 
structural zero. It is easy to show that the likelihood is exactly the same as 
the proposed model. 

In practice, one may wish to test for the existence of zero inflation, which 
can be performed under the parametric model framework. The likelihood 
ratio statistic is given by 


A — 2{li — Iq), 


where li is the maximized log-likelihood for the zero-inflated model, and Iq 
is the maximized log-likelihood for the ordinary GLMM. As the null hypoth¬ 
esis (p = 0) is on the boundary of the parameter space, the asymptotic null 
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distribution of A is a mixture of Xi and point mass at 0, with equal mixture 
probabilities [Self and Liang (1987)]. Theoretically, we could also construct 
a score test statistic similar to the test proposed by van den Broek (1995) 
for zero inflation in a Poisson distribution. However, for our problem, the 
likelihood function involves intractable integrals, making the score and in¬ 
formation matrix both difficult to evaluate. So in our application, we apply 
the likelihood ratio test. 

4. Simulation studies. Motivated by the NEXT study, the data genera¬ 
tion for the simulation studies mimics the real example. To evaluate the sta¬ 
tistical properties of the above methods, simulation studies of the true model 
and a misspecified model were run with two different levels of within-cluster 
correlation. A sample size of A^ = 2000 with a cluster size of J = 5 ques¬ 
tions is considered. The simulations were repeated 5000 times to compare 
the performance of the naive estimator (GLMM, where the zero-inflation 
is ignored), the maximum-likelihood (ML) estimator and the five GEE es¬ 
timators (GEE-MI, GEE-GI, GEE-ME, GEE-CE and GEE-UN). We cal¬ 
culated the average (Mean) and standard deviation (SD) of the estimated 
parameters, average of the estimated standard errors (SE) and 95% GI cov¬ 
erage rates (GOVER) based on the Wald intervals to evaluate the robustness 
and efficiency of the GEE and the maximum-likelihood approaches. Twenty 
Gaussian-Hermite quadrature points were used for computing the GLMM 
and ML estimators. We also tried 10 and 40 quadrature points as well as 
the adaptive quadrature with 250 simulated data sets. In our simulations, 
the results are very similar for differing number of quadrature points. Our 
experience for generalized linear mixed models with the logit link function 
is that Gaussian quadrature works very well, and in most situations AGQ is 
not needed. In terms of numerical efficiency, we found that the computation 
time for AGQ is about 10-20 times longer than the fixed quadrature. 

The estimated parameters for ML and GLMM methods were marginal¬ 
ized, as we described in Section 2.3. In the following sections, we evaluate the 
performance of the maximum likelihood and GEE under a correctly specified 
and a misspecihed model. Additional simulation results are reported in the 
supplementary material [Fulton et al. (2015)], including (a) the performance 
of the proposed model with a smaller sample size (A^ = 500); (b) sensitiv¬ 
ity of assuming a constant zero-inflation probability when the probability is 
affected by covariates; (c) performance of zero-inflated beta-binomial model. 

4.1. Simulation one: Correctly specified model. We generated a contin¬ 
uous subject-level covariate Xi from a standard normal distribution and 
categorical covariate Qij = 1,..., 5 to denote the questions for each subject. 
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The zero-inflation indicator Zi was generated from Bernoulli(p) with p = 0.7. 
The outcome Yij was generated from a probit random effects model: 

Qij) Yi = 1) 

= <h{7o -|- 7iXj -|- 'y2l{Qij = 2) + 73/(Qij = 3) 

+ = 4) -|- ')^I{Qij = 5) -|- hi}, 

where /(•) is the indicator function and hi is the random intercept fol¬ 
lowing a normal distribution A^(0,cr^). We fixed the regression parameters 
7 = ( 70 , 71 , 72 , 73 , 74 , 75 )' = (0,1,-0.5,-0.4,0.2,0.4)'. The variance compo¬ 
nent was taken to be 0.5^ and 1.5^, respectively, to reflect weak (Pear¬ 
son correlation of about 0.1) and strong (Pearson correlation of about 0.45) 
within-cluster correlations. The simulation results are shown in Tables 1 and 
2, where the true regression parameters are the marginal covariate effects 
given by/3 = ^^. 

Both the ML and the five GEE methods perform reasonably well, in terms 
of small bias and good Cl coverage rate. GLMM is seriously biased with 
poor Cl coverage. It can be seen that the ML method is the most efficient, 
as it makes use of the full distributional assumption of the observed data. 
On the contrary, GEE only relies on the first moments of the outcome. In 
estimating p, the zero-inflated probability, the SEs of the GEE approaches 


Table 1 

The mean of 5000 simulations of estimated eoefficients (Mean), empirical standard 
deviation (SD), average standard error (SE) and the 95% interval coverage rate 
(COVER) for the maximum-likelihood, naive and GEE methods of the correctly specified 

model with at — 0.5, N — 2000 



Parameter* 

True 

Mean 

SD 

SE 

COVER 

ML 

Pr(Z= 1) 

0.700 

0.700 

0.014 

0.014 

0.949 


(Jb 

0.500 

0.499 

0.040 

0.039 

0.953 


Po 

0.000 

0.000 

0.039 

0.040 

0.959 


Pi 

0.894 

0.895 

0.027 

0.027 

0.949 


P 2 

-0.447 

-0.448 

0.051 

0.051 

0.951 


P3 

-0.358 

-0.358 

0.050 

0.051 

0.956 


pA 

0.179 

0.179 

0.051 

0.050 

0.949 


/^S 

0.358 

0.358 

0.050 

0.051 

0.948 

GLMM 

(Jb 

0.500 

1.352 

0.047 

0.046 

0.000 


Po 

0.000 

-0.444 

0.029 

0.030 

0.000 


Pi 

0.894 

0.570 

0.028 

0.025 

0.000 


P 2 

-0.447 

-0.301 

0.034 

0.034 

0.013 


P3 

-0.358 

-0.239 

0.033 

0.034 

0.060 


pA 

0.179 

0.114 

0.032 

0.032 

0.489 



0.358 

0.224 

0.031 

0.032 

0.015 
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Table 1 
( Continued) 



Parameter* 

True 

Mean 

SD 

SE 

COVER 

GEE-MI 

Pr(Z= 1) 

0.700 

0.704 

0.041 

0.040 

0.951 


Po 

0.000 

-0.002 

0.096 

0.096 

0.951 


Pi 

0.894 

0.897 

0.061 

0.061 

0.947 


P2 

-0.447 

-0.448 

0.060 

0.060 

0.950 


Pz 

-0.358 

-0.358 

0.056 

0.057 

0.952 


Pi 

0.179 

0.179 

0.055 

0.054 

0.952 


P5 

0.358 

0.358 

0.059 

0.059 

0.946 

GEE-CI 

Pr(Z= 1) 

0.700 

0.701 

0.029 

0.029 

0.948 


Po 

0.000 

0.001 

0.069 

0.070 

0.955 


Pi 

0.894 

0.897 

0.044 

0.043 

0.953 


P2 

-0.447 

-0.449 

0.055 

0.056 

0.953 


P3 

-0.358 

-0.359 

0.053 

0.054 

0.958 


Pi 

0.179 

0.179 

0.052 

0.052 

0.951 


PZ 

0.358 

0.360 

0.056 

0.056 

0.946 

GEE-ME 

Pr(Z= 1) 

0.700 

0.701 

0.032 

0.032 

0.953 


Po 

0.000 

0.001 

0.077 

0.078 

0.953 


Pi 

0.894 

0.898 

0.049 

0.049 

0.949 


P2 

-0.447 

-0.449 

0.058 

0.058 

0.955 


Pz 

-0.358 

-0.359 

0.055 

0.056 

0.956 


Pi 

0.179 

0.180 

0.054 

0.054 

0.952 


Pz 

0.358 

0.359 

0.057 

0.058 

0.950 

GEE-CE 

Pr(Z= 1) 

0.700 

0.701 

0.029 

0.029 

0.950 


Po 

0.000 

0.001 

0.069 

0.070 

0.955 


Pi 

0.894 

0.897 

0.044 

0.043 

0.953 


P2 

-0.447 

-0.449 

0.055 

0.056 

0.954 


Pz 

-0.358 

-0.359 

0.052 

0.054 

0.958 


Pi 

0.179 

0.179 

0.052 

0.052 

0.951 


Pz 

0.358 

0.360 

0.056 

0.056 

0.946 

GEE-UN 

Pr(Z= 1) 

0.700 

0.702 

0.036 

0.036 

0.952 


Po 

0.000 

0.000 

0.085 

0.085 

0.952 


Pi 

0.894 

0.897 

0.053 

0.053 

0.948 


P2 

-0.447 

-0.448 

0.058 

0.059 

0.954 


Pz 

-0.358 

-0.358 

0.055 

0.056 

0.954 


Pi 

0.179 

0.179 

0.055 

0.054 

0.952 


Pz 

0.358 

0.359 

0.058 

0.058 

0.948 

*P{Yij = 

Qij^ Zi = 1) 

— *3&{/3o + PiXij + (321 {Qij 

— 2) + — 3) + (34:1 {Qij — 


4) + PbliQij — 5)}. 


are more than twice as large as the SE of the ML method. The SEs for other 
parameters are also significantly smaller for the ML method. 

Of the five GEE methods, we found that GEE-CE is the most efficient 
with the smallest SE, while GEE-MI is the least efficient. By exploiting the 
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Table 2 

The mean of 5000 simulations of estimated coefficients (Mean), empirical standard 
deviation (SD), average standard error (SE) and the 95% interval coverage rate 
(COVER) for the maximum-likelihood, naive and GEE methods of the correctly specified 

model with as = 1.5, N = 2000 



Parameter* 

True 

Mean 

SD 

SE 

COVER 

ML 

Pr(Z= 1) 

0.700 

0.701 

0.024 

0.024 

0.951 


o-b 

1.500 

1.503 

0.088 

0.089 

0.955 


/3o 

0.000 

-0.001 

0.053 

0.053 

0.950 


Pi 

0.555 

0.555 

0.033 

0.033 

0.949 


P2 

-0.277 

-0.278 

0.037 

0.038 

0.952 


P3 

-0.222 

-0.222 

0.037 

0.037 

0.949 


Pi 

0.111 

0.111 

0.036 

0.036 

0.954 


P5 

0.222 

0.222 

0.037 

0.037 

0.951 

GLMM 

O-b 

1.500 

2.248 

0.071 

0.073 

0.000 


Po 

0.000 

-0.419 

0.032 

0.029 

0.000 


Pi 

0.555 

0.384 

0.029 

0.026 

0.000 


P2 

-0.277 

-0.204 

0.027 

0.027 

0.230 


P3 

-0.222 

-0.163 

0.027 

0.027 

0.387 


Pi 

0.111 

0.079 

0.025 

0.026 

0.763 


P5 

0.222 

0.155 

0.025 

0.026 

0.271 

GEE-MI 

Pr(Z= 1) 

0.700 

0.712 

0.083 

0.088 

0.949 


Po 

0.000 

-0.002 

0.161 

0.170 

0.966 


Pi 

0.555 

0.560 

0.072 

0.075 

0.962 


P2 

-0.277 

-0.279 

0.050 

0.050 

0.949 


P3 

-0.222 

-0.223 

0.046 

0.046 

0.950 


Pi 

0.111 

0.111 

0.040 

0.040 

0.954 


P5 

0.222 

0.223 

0.047 

0.049 

0.953 

GEE-GI 

Pr(Z= 1) 

0.700 

0.709 

0.064 

0.064 

0.954 


Po 

0.000 

-0.006 

0.121 

0.123 

0.969 


Pi 

0.555 

0.556 

0.056 

0.057 

0.958 


P2 

-0.277 

-0.278 

0.045 

0.045 

0.951 


P3 

-0.222 

-0.222 

0.042 

0.042 

0.946 


Pi 

0.111 

0.111 

0.038 

0.039 

0.952 


P5 

0.222 

0.223 

0.044 

0.045 

0.951 

GEE-ME 

Pr(Z= 1) 

0.700 

0.706 

0.058 

0.057 

0.947 


Po 

0.000 

-0.001 

0.115 

0.114 

0.954 


Pi 

0.555 

0.557 

0.053 

0.053 

0.952 


P2 

-0.277 

-0.279 

0.045 

0.044 

0.948 


P3 

-0.222 

-0.223 

0.042 

0.042 

0.947 


Pi 

0.111 

0.111 

0.039 

0.039 

0.955 


P5 

0.222 

0.223 

0.043 

0.044 

0.953 
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Table 2 
( Continued) 


Parameter* 

True 

Mean 

SD 

SE 

COVER 

GEE-CE Pr(Z = 1) 

0.700 

0.707 

0.056 

0.056 

0.951 

/3o 

0.000 

-0.003 

0.111 

0.111 

0.956 

Pi 

0.555 

0.556 

0.052 

0.051 

0.951 

P2 

-0.277 

-0.278 

0.043 

0.043 

0.952 

P3 

-0.222 

-0.222 

0.041 

0.041 

0.950 

P4 

0.111 

0.111 

0.038 

0.038 

0.955 

05 

0.222 

0.223 

0.043 

0.043 

0.954 

GEE-UN Pr(^ = 1) 

0.700 

0.709 

0.068 

0.068 

0.950 

Po 

0.000 

-0.003 

0.132 

0.134 

0.959 

Pi 

0.555 

0.558 

0.060 

0.060 

0.952 

P2 

-0.277 

-0.278 

0.046 

0.046 

0.948 

P3 

-0.222 

-0.223 

0.044 

0.043 

0.948 

P4 

0.111 

0.111 

0.039 

0.039 

0.951 

P3 

0.222 

0.223 

0.045 

0.046 

0.952 

*P{Yij = = 1) 

= ${Po + PiXij 

+ 02^{Qij 

= 2) + PsIiQ 

“1“ 04:^{Qij — 


4) + PbliQij ~ 5)}. 


correlation structure induced by the zero-inflation process, the conditional 
independence and exchangeable working correlation both gain a substantial 
amount of efficiency, compared to their marginal counterparts. This result 
is consistent with the simulation results in Hall and Zhang (2004). The SEs 
for GEE-CE and GEE-CI are quite close, implying that adding working de¬ 
pendence to the outcome given Zi = l would not help much as long as the 
dependence due to zero inflation is accounted for. We did observe a bigger 
improvement of GEE-GE versus GEE-CI for the strong correlation case. But 
the improvement of GEE-CI versus GEE-MI is even larger. Therefore, we 
recommend that it is more important to make use of the zero-inflation struc¬ 
ture in the GEE estimators. Although GEE-UN has the most flexible form 
of working correlation, it is not as efficient as GEE-CI or GEE-CE, prob¬ 
ably due to estimating a larger number of nuisance parameters. We found 
that GEEs may occasionally not have a solution or have a boundary solu¬ 
tion {p = I) in about 1-2% of the simulations with ab = 1.5. Our experience 
is that nonconvergence or boundary solutions occur more often when the 
covariate effects are weaker, the within-cluster correlation is stronger, the 
true zero-inflation probability is closer to 1, or the model is more severely 
misspecified. 

4.2. Simulation two: Model misspecification. We consider a misspecified 
model where only the hrst three questions are correlated and the last two 
are independent. The data generation for Yij with j = 1,2,3 was the same 
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as in Section 3.1, but Yij for j = 4,5 was generated as follows; 

P(Xij = Qij^bi, Zi = 1 ) 

7o + liXij + l4:I{Qij = 4) + 'y 5 l{Qij = 5) 

The random intercept bi was not added to the last two questions, but a factor 
of was divided to the coefficients to keep the marginalized regression 

coefficients the same. In this case, the ML estimator is from a misspecified 
model since the random intercept model imposes correlation among all the 
questions. For GEE, only the working correlation is misspecified, while the 
first moment of Yij is still correct. The simulation results are presented in 
Tables 3 and 4. 

With cjfe = 0.5, the ML approach is almost unbiased for estimating p as 
well as the regression coefficients. When cTf, increases to 1.5, the ML esti- 



Table 3 

The mean of 5000 simulations of estimated eoefficients (Mean), empirical standard 
deviation (SD), average standard error (SE) and the 95% interval coverage rate 
(COVER) for the maximum-likelihood, naive and GEE methods of the misspecified model 

with (Jb = 0.5, N = 2000 



Parameter* 

True 

Mean 

SD 

SE 

COVER 

ML 

Vr{Z = l) 

0.700 

0.706 

0.013 

0.013 

0.925 


o-b 

0.500 

0.273 

0.047 

0.047 

0.000 


do 

0.000 

-0.008 

0.038 

0.039 

0.953 


di 

0.894 

0.899 

0.024 

0.024 

0.948 


02 

-0.447 

-0.446 

0.051 

0.053 

0.963 


da 

-0.358 

-0.357 

0.049 

0.053 

0.964 


d4 

0.179 

0.177 

0.053 

0.052 

0.949 


ds 

0.358 

0.355 

0.053 

0.053 

0.949 

GLMM 


0.500 

1.176 

0.042 

0.040 

0.000 


do 

0.000 

-0.441 

0.029 

0.030 

0.000 


di 

0.894 

0.570 

0.028 

0.024 

0.000 


02 

-0.447 

-0.303 

0.035 

0.036 

0.019 


da 

-0.358 

-0.241 

0.034 

0.035 

0.080 


d4 

0.179 

0.114 

0.034 

0.034 

0.519 


ds 

0.358 

0.223 

0.033 

0.034 

0.021 

GEE-MI 

Pr(^ = l) 

0.700 

0.704 

0.041 

0.040 

0.951 


do 

0.000 

-0.002 

0.095 

0.095 

0.951 


di 

0.894 

0.896 

0.059 

0.060 

0.949 


02 

-0.447 

-0.447 

0.059 

0.060 

0.953 


da 

-0.358 

-0.357 

0.056 

0.057 

0.953 


d4 

0.179 

0.178 

0.057 

0.057 

0.950 


ds 

0.358 

0.358 

0.061 

0.062 

0.948 
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Table 3 
( Continued) 


Parameter* 

True 

Mean 

SD 

SE 

COVER 

GEE-CI Pr(Z=l) 

0.700 

0.701 

0.030 

0.030 

0.950 

do 

0.000 

0.000 

0.070 

0.071 

0.955 

di 

0.894 

0.897 

0.044 

0.044 

0.951 

P2 

-0.447 

-0.449 

0.055 

0.056 

0.956 

da 

-0.358 

-0.358 

0.053 

0.054 

0.957 

d4 

0.179 

0.179 

0.055 

0.055 

0.949 

ds 

0.358 

0.360 

0.059 

0.059 

0.951 

GEE-ME Pr(Z = 1) 

0.700 

0.702 

0.034 

0.033 

0.951 

do 

0.000 

0.000 

0.079 

0.080 

0.952 

di 

0.894 

0.897 

0.049 

0.050 

0.949 

P2 

-0.447 

-0.448 

0.058 

0.058 

0.955 

da 

-0.358 

-0.358 

0.055 

0.056 

0.954 

d4 

0.179 

0.179 

0.057 

0.057 

0.952 

da 

0.358 

0.359 

0.061 

0.061 

0.951 

GEE-CE Pr(Z=l) 

0.700 

0.701 

0.030 

0.030 

0.951 

do 

0.000 

0.000 

0.070 

0.071 

0.955 

di 

0.894 

0.897 

0.044 

0.044 

0.950 

02 

-0.447 

-0.448 

0.055 

0.056 

0.955 

da 

-0.358 

-0.358 

0.053 

0.054 

0.956 

d4 

0.179 

0.179 

0.055 

0.055 

0.950 

da 

0.358 

0.359 

0.059 

0.059 

0.952 

GEE-UN Pr(Z = 1) 

0.700 

0.703 

0.037 

0.036 

0.952 

do 

0.000 

-0.001 

0.086 

0.087 

0.951 

di 

0.894 

0.897 

0.053 

0.054 

0.948 

02 

-0.447 

-0.448 

0.058 

0.059 

0.953 

da 

-0.358 

-0.358 

0.055 

0.056 

0.954 

d4 

0.179 

0.179 

0.057 

0.057 

0.950 

da 

0.358 

0.359 

0.061 

0.062 

0.950 

*P{Yij = = 1) 

= ${do -t PiXij 

+ (321 {Qij 

= 2) + 

^3 — 3) + 04,1 {Qij — 


4) + PbliQij ~ 5)}. 


mator becomes slightly biased with poor Cl coverage, especially for p and 
/3o- The estimation of other parameters appears to be robust to the model 
misspecihcation, except that the SEs for /32 and /^s overestimate the true 
variability. On the other hand, the hve GEE methods all perform quite well, 
in terms of little bias and close-to-nominal coverage rates. Similar to the pre¬ 
vious simulation study, we observed that about 1% of the GEE simulations 
did not converge for at = 1.5. Although the maximum-likelihood approach 
is biased, its standard error is much smaller than the GEE approaches. Eor 
example, the ML estimator for p in Table 4 has a SE only a quarter as large 
as that of the GEE-GI and GEE-CE estimators. As a result of the variance- 
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Table 4 

The mean of 5000 simulations of estimated coefficients (Mean), empirical standard 
deviation (SD), average standard error (SE) and the 95% interval coverage rate 
(COVER) for the maximum-likelihood, naive and GEE methods of the misspecified model 

with Oh = 1.5, N = 2000 



Parameter* 

True 

Mean 

SD 

SE 

COVER 

ML 

Pr(Z= 1) 

0.700 

0.730 

0.015 

0.015 

0.472 


CTf. 

1.500 

0.624 

0.037 

0.038 

0.000 


/3o 

0.000 

-0.046 

0.039 

0.039 

0.776 


fii 

0.555 

0.561 

0.024 

0.024 

0.941 


P2 

-0.277 

-0.273 

0.036 

0.046 

0.986 


P3 

-0.222 

-0.218 

0.036 

0.045 

0.987 


134 

0.111 

0.103 

0.047 

0.044 

0.933 


f3s 

0.222 

0.208 

0.046 

0.045 

0.934 

GLMM 

o-b 

1.500 

1.201 

0.038 

0.039 

0.000 


/3o 

0.000 

-0.416 

0.029 

0.030 

0.000 


Pi 

0.555 

0.378 

0.024 

0.022 

0.000 


P2 

-0.277 

-0.206 

0.027 

0.034 

0.425 


P3 

-0.222 

-0.164 

0.027 

0.034 

0.613 


P4 

0.111 

0.079 

0.034 

0.033 

0.816 



0.222 

0.153 

0.033 

0.032 

0.432 

GEE-MI 

Pr(Z= 1) 

0.700 

0.712 

0.080 

0.082 

0.948 


Po 

0.000 

-0.005 

0.153 

0.159 

0.963 


Pi 

0.555 

0.558 

0.066 

0.067 

0.954 


P2 

-0.277 

-0.278 

0.048 

0.049 

0.946 


P3 

-0.222 

-0.222 

0.045 

0.045 

0.949 


P4 

0.111 

0.111 

0.054 

0.054 

0.951 


P5 

0.222 

0.223 

0.058 

0.06 

0.954 

GEE-GI 

Pr(Z= 1) 

0.700 

0.707 

0.066 

0.066 

0.944 


Po 

0.000 

-0.001 

0.129 

0.130 

0.966 


Pi 

0.555 

0.558 

0.058 

0.058 

0.960 


P2 

-0.277 

-0.279 

0.046 

0.046 

0.948 


P3 

-0.222 

-0.223 

0.043 

0.043 

0.950 


P4 

0.111 

0.112 

0.053 

0.052 

0.950 


P5 

0.222 

0.224 

0.057 

0.058 

0.951 

GEE-ME 

Pr(Z= 1) 

0.700 

0.709 

0.069 

0.068 

0.948 


Po 

0.000 

-0.004 

0.132 

0.133 

0.951 


Pi 

0.555 

0.557 

0.060 

0.060 

0.952 


P2 

-0.277 

-0.278 

0.046 

0.046 

0.946 


P3 

-0.222 

-0.222 

0.043 

0.043 

0.946 


P4 

0.111 

0.112 

0.054 

0.054 

0.950 


P6 

0.222 

0.224 

0.058 

0.059 

0.953 
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Table 4 
( Continued) 


Parameter* 

True 

Mean 

SD 

SE 

COVER 

GEE-CE Pr(Z = 1) 

0.700 

0.709 

0.067 

0.066 

0.946 

/3o 

0.000 

-0.004 

0.129 

0.130 

0.959 


0.555 

0.556 

0.058 

0.058 

0.954 

P2 

-0.277 

-0.279 

0.045 

0.046 

0.948 

da 

-0.222 

-0.223 

0.043 

0.043 

0.948 

d4 

0.111 

0.111 

0.053 

0.052 

0.948 


0.222 

0.224 

0.057 

0.058 

0.951 

GEE-UN Pr(^ = 1) 

0.700 

0.711 

0.073 

0.073 

0.951 

do 

0.000 

-0.005 

0.140 

0.142 

0.954 

di 

0.555 

0.557 

0.061 

0.061 

0.952 

P2 

-0.277 

-0.278 

0.047 

0.047 

0.947 

da 

-0.222 

-0.222 

0.044 

0.044 

0.948 

d4 

0.111 

0.111 

0.053 

0.053 

0.950 

d5 

0.222 

0.223 

0.058 

0.059 

0.953 

*P{Yij = = 1) 

= ${Po + PiXij 

+ 02^{Qij 

= 2) + PsIiQ 

“1“ 04:^{Qij — 


4) + PbliQij ~ 5)}. 


bias trade-off, the mean squared error for the ML estimator is still smaller 
than GEE. If the interest is in estimation, one can still argue that the ML 
performs better; but if the interest is in hypothesis testing, GEE methods 
are preferred, as they are more robust and preserve the correct Type I error 
rate. 

From the above two sets of simulation studies, we would generally recom¬ 
mend the ML estimator in practice because of its high efficiency. The cor¬ 
relation structure of the outcome is critical in identifying the zero-inflation 
process. Therefore, a full parametric assumption for the correlation can lead 
to good efficiency in the estimation. However, if this parametric assumption 
does not hold, the ML estimator could have poor GI coverage rates. In order 
to perform hypothesis testing, we would prefer the GEE approaches, which 
only rely on the correct mean model and are not sensitive to the working 
correlation assumption. Among the hve GEE approaches, the GEE-CI and 
GEE-GE are the most favorable, because they are more efficient by exploit¬ 
ing the dependence structure induced by the zero-inflation process. In prac¬ 
tice, the GEE-GI and GEE-GE estimators may be computed in conjunction 
with the ML estimator as a sensitivity analysis. 

5. Dating violence data example. In this section we fit our proposed ML 
and GEE models to the dating violence example, together with the naive 
GLMM model. A total of 2787 students were enrolled in the study, among 
which 664 left all the dating violence questions blank. These 664 subjects 
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are either not in a relationship, and thus skipped these questions, or they 
did not respond at all to the whole survey. In the remaining 2123 subjects, 
39 were excluded because they only answered part of the dating violence 
questions, and 61 were excluded because they have missing data in the co¬ 
variates. The final analysis sample was N = 2023. The clustered outcomes 
of interest (Tjj) are the ten questions of dating violence, including five vic¬ 
timization and five perpetration questions. We can see from Figure 1 that 
the frequency histogram of “yes” responses shows a huge spike at 0. It seems 
likely that some students who answered all the questions with “no” were not 
in a relationship, that is, a zero inflation of the outcome. Define the latent 
variable Zi = 1 if the subject is in a relationship and 0 otherwise. We in¬ 
cluded gender (GENDER,), depressive symptoms (DS,), family relationship 
(FRj) and family influence (E/,) as the predictors of Yij given Z, = 1. The 
DS score comes from the questionnaire of depressive symptoms and is on the 
continuous scale ranging from 1 to 5, with the larger score indicating worse 
depressive symptoms. The FR (ranging from 0 to 10) measures the partic¬ 
ipant’s satisfaction with the relationship in his/her family, with 10 being a 
very good relationship. The FI score (ranging from 1 to 7) is the family in¬ 
fluence on the participants not verbally or physically abusing their romantic 
partner, with a higher score being greater influence. We adjust for question 
number as a factor and question type (victimization vs. perpetration), in 
order to account for different prevalence of yes responses. The interactions 
between question type and other covariates (GENDER,, DSj, FR, and FI,) 
are also included. The summary statistics of these variables are described in 
Table 5. 

Denote X,j to be the design matrix including all the covariates and in¬ 
teraction terms mentioned above. We fit the probit random effects model 

(5.1) P(Xij — Zi = l) = <k(XE7 -|- bi), 

where 6, ~N(0,cr^) is the random intercept. This model has the same 
marginal mean as the marginal probit regression model: 

(5.2) PiYi, = l\Xi„Zi = 1) = ‘k(XF/3) 

as /3 = ^ for /c = 0,..., 6. For comparative purposes, we report the 

marginal regression coefficients /3 for all the analyses. 

The results of the ML, GLMM and GEE estimations are listed in Table 6. 
From the ML estimation, we can see that all four subject-level covariates 
are significant: the probability of dating violence perpetration was higher for 
females, those who are more depressed, those who have a worse relationship 
in their family, and those who are less influenced by their guardians. The 
interaction terms between question type and gender, and question type and 
family influence were both significant, suggesting (a) boys are more likely to 
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Table 5 

Summary statistics of the dating violence example. Percentage is reported for the 
categorical variables and mean (standard deviation) is reported for the continuous 

variables 


Variable 

Summary statistics (n = 2023) 

Gender: female 

56.6% 

DS score 

2.0 (1.0) 

Family relationship 

7.4 (2.3) 

Family influence 

5.7 (1.8) 

Question IV—Insult you 

18.5% 

Question IP—Insult your boyfriend/girlfriend 

16.8% 

Question 2V—Swear at you 

31.3% 

Question 2P—Swear at your boyfriend/girlfriend 

26.1% 

Question 3V—Threaten you 

7.2% 

Question 3P—Threaten your boyfriend/girlfriend 

5.6% 

Question 4V—Push you 

13.5% 

Question 4P—Push your boyfriend/girlfriend 

9.9% 

Question 5V—Throw object at you 

4.5% 

Question 5P—Throw object at your boyfriend/girlfriend 

3.8% 


be the victims of dating violence, and (b) the family influence has a slightly 
higher impact on dating violence perpetration than victimization. The find¬ 
ing regarding greater male dating violence victimization in this age is in 
line with previous studies [Foshee (1996); Archer (2000)]. The impacts of 
depression score and family relationship are similar regardless of question 
type. The directions of association are expected and consistent with some 
of the hndings in Haynie et al. (2013). The zero-inflation probability is es¬ 
timated to be 0.571, that is, we expect about 43% of the sample (about 
860 subjects) to be structural zeros. We suspect that a majority of them 
were not in a relationship, but answered all the dating violence questions 
with “no.” However, there may be alternative reasons for the zero inflation, 
for example, some kids may give socially desirable answers in the survey 
and hence underreport dating violence. However, we believe that this only 
accounts for a small fraction of the structural zeros. The likelihood ratio 
test statistic for zero inflation is A = 65.2 (p-value < 0.001). The parameter 
estimations by GEE are generally close to ML, but the standard errors are 
larger. The naive GLMM method estimated smaller covariate effects, which 
could be biased due to ignoring the zero-inflated nature of the data. 

As pointed out by a referee, the zero-inflation problem could be avoided by 
including a filter question of asking whether the subject had a relationship or 
not. The filter question was not included because the study investigators felt 
that it was an unreliable question to ask. Relationships between teenagers 
today cannot easily be characterized, and the investigators felt that explicitly 





to 

to 


Table 6 

Parameter estimation and standard errors using the ML, GLMM and GEE methods for the dating violence victimization example 


Parameter 

ML 

GLMM 

GEE-MI 

GEE-CI 

GEE-ME 

GEE-CE 

GEE-UN 

(Intercept) 

-0.192 (0.176) 

-0.746 (0.177) 

-0.055 (0.254) 

-0.159 (0.225) 

-0.497 (0.222) 

-0.341 (0.221) 

-0.201 (0.232) 

Question 2 

0.528 (0.037) 

0.374 (0.024) 

0.578 (0.071) 

0.508 (0.057) 

0.475 (0.058) 

0.488 (0.057) 

0.529 (0.061) 

Question 3 

-0.754 (0.043) 

-0.577 (0.031) 

-0.814 (0.069) 

-0.750 (0.061) 

-0.716 (0.063) 

-0.729 (0.061) 

-0.768 (0.063) 

Question 4 

-0.332 (0.035) 

-0.250 (0.026) 

-0.360 (0.053) 

-0.335 (0.048) 

-0.315 (0.047) 

-0.324 (0.047) 

-0.339 (0.049) 

Question 5 

-1.003 (0.050) 

-0.773 (0.036) 

-1.069 (0.083) 

-0.996 (0.073) 

-0.949 (0.077) 

-0.969 (0.074) 

-1.013 (0.075) 

Question type 

0.222 (0.122) 

0.158 (0.089) 

0.264 (0.173) 

0.234 (0.150) 

0.215 (0.146) 

0.223 (0.145) 

0.215 (0.155) 

DS score 

0.192 (0.036) 

0.178 (0.035) 

0.250 (0.043) 

0.220 (0.038) 

0.226 (0.037) 

0.219 (0.037) 

0.230 (0.039) 

Gender 

0.132 (0.067) 

0.153 (0.054) 

0.171 (0.076) 

0.118 (0.068) 

0.186 (0.067) 

0.160 (0.066) 

0.164 (0.070) 

Family relationship 

-0.037 (0.014) 

-0.035 (0.012) 

-0.042 (0.019) 

-0.039 (0.017) 

-0.041 (0.015) 

-0.041 (0.016) 

-0.041 (0.017) 

Family influence 

-0.114 (0.017) 

-0.087 (0.015) 

-0.137 (0.021) 

-0.135 (0.018) 

-0.095 (0.017) 

-0.112 (0.017) 

-0.123 (0.018) 

Question type x DS score 

0.037 (0.025) 

0.026 (0.018) 

0.034 (0.032) 

0.037 (0.029) 

0.028 (0.026) 

0.034 (0.028) 

0.035 (0.029) 

Question type x gender 

-0.409 (0.051) 

-0.300 (0.037) 

-0.446 (0.066) 

-0.392 (0.057) 

-0.373 (0.056) 

-0.379 (0.056) 

-0.399 (0.059) 

Question type x family 

-0.003 (0.010) 

-0.002 (0.008) 

-0.009 (0.014) 

-0.007 (0.012) 

-0.006 (0.012) 

-0.006 (0.012) 

-0.005 (0.013) 

relationship 








Question type x family 

0.025 (0.012) 

0.019 (0.009) 

0.031 (0.015) 

0.025 (0.013) 

0.026 (0.013) 

0.024 (0.013) 

0.028 (0.013) 

influence 








Pr(Z= 1) 

0.571 (0.031) 

1 

0.523 (0.062) 

0.626 (0.075) 

0.690 (0.104) 

0.663 (0.089) 

0.587 (0.070) 

Ob 

1.033 (0.065) 

1.627 (0.055) 

- 

- 

- 

- 

- 
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asking this question may limit important responses about violence [Short 
et al. (2013)]. There are other cases where the susceptible population cannot 
be ascertained accurately. For example, in drug abuse studies, specifically 
asking whether individuals are drug abusers might not be a question that 
would result in a reliable response. But we may be interested in whether 
the abusers seek particular types of treatment. Additionally, this approach 
may be relevant to questions regarding immigration status, where there may 
be legal implications (perceived or real) in answering, and in questions on 
mental health status, where the respondent may have a reduced ability to 
accurately report their status. 

6. Discussion. In public health research, excessive zero responses may 
occur if the population which is susceptible to respond is not carefully 
screened or is unknown. The resulting zero inflation may have an effect 
on the results obtained by conventional methods of analysis. In the NEXT 
Generation Health Study, investigators were interested in identifying predic¬ 
tors of dating violence in teenagers. Examining these regression relationships 
are of interest for those individuals who are in a relationship (i.e., the sus¬ 
ceptible condition). Many more individuals completed this study component 
than investigators thought would be in a relationship at this age. This led 
to what appears to be zero-inflated clustered binary data. We developed 
both ML and GEE approaches for analyzing such data. Through simula¬ 
tions and analysis of the real data example, we found that the ML approach 
is substantially more efficient than the GEE approaches. However, under 
moderate model misspecification, the ML approach may result in biased in¬ 
ference. It is recommended that, as a sensitivity analysis, both ML and GEE 
approaches be applied in applications. 

In the GEE approach, we treat the regression parameters and nuisance 
working correlation as orthogonal, that is, a GEEl approach with the param¬ 
eters in the working correlation estimated by a moment estimator. Potential 
efficiency gain could be achieved using an improved version of GEEl [Pren¬ 
tice (1988)] or GEE2 [Prentice and Zhao (1991); Liang, Zeger and Qaqish 
(1992)], by establishing another set of estimating equations on the second 
moment. However, GEE2 requires a correct variance structure with working 
third and fourth moment model, which is hard to verify with the presence 
of zero inflation, and results in bias under second, third and fourth moment 
misspecification. The previous work of Hall and Zhang (2004) adopted Pren¬ 
tice and Zhao’s GEE2 that maintains the parameter orthogonality in their 
second moment estimating equations, and they argued that only making a 
first moment assumption may lead to parameter nonidentifiability. This is 
true in their case, where the zero-inflation probability is at the observation 
level, so pij and pfj might be confounded in . However, in the case of 
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subject-level zero inflation, we have shown the model identifiability of the 
GEEl estimators. 

In principle, zero-inflated models cannot be identified nonparametrically; 
parametric assumptions for the nonzero part play a fundamental role in 
model estimation. For example, zero inflation in Poisson and binomial data 
can be determined by the lack of fit in the zero cells of these respective 
distributions. In this paper, we assume that the nonzero distribution is given 
by a generalized linear mixed model with normal random effects. The ML 
approach exploits the correlation structure in order to distinguish structural 
zeros and random zeros. Intrinsically, GEE only uses the mean structure of 
the binary data in order to estimate regression parameters and, unlike ML, 
does not use the entire distribution for estimation. 

In our application, there were very few missing data. However, in many 
studies with sensitive psychological or behavioral questions, there may be in¬ 
formative missingness. An advantage of the ML approach is that it can more 
easily be extended to account for informative missing responses [see Foll- 
mann and Wu (1995), e.g.]. The proposed methodology implicitly assumes 
that the subjects answer the questions truthfully. If this assumption does not 
hold, the parameter estimation is likely to be biased. We could formulate a 
likelihood approach if we had good prior information about the distribution 
of false negative occurrence across the questions. This would require a veri¬ 
fication subsample corresponding to the survey questions (maybe obtained 
through interviews of parents and friends) on a fraction of teenagers. 

The zero-inflation probability in our model is assumed to be constant, but 
it is straightforward to include covariates in both ML and GEE approaches. 
In addition, our focus is on a cross-sectional inference, that is, analyzing the 
dating violence data at one point in time (11th grade). Understanding be¬ 
havior change from adolescence over time is interesting but also challenging. 
In the longitudinal setting, the zero-inflation probability is time-dependent 
that can probably be modeled by a latent Markov process. We will leave it 
for future exploration. 
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SUPPLEMENTARY MATERIAL 

Supplement to “Mixed model and estimating equation approaches for 
zero inflation in clustered binary response data with application to a dating 
violence study” (DOI: 10.1214/14-AOAS791SUPP; .pdf). Supplement A: 
Additional simulation one. Examine the performance of the proposed model 
with a smaller sample size (A^ = 500). Supplement B: Additional Simulation 
Two. Examine the sensitivity of assuming a constant zero-inflation probabil¬ 
ity when the probability is affected by covariates. Supplement C: Additional 
Simulation Three. Examine the performance of zero-inflated beta-binomial 
model. 
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